Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new method to return the final state vector array instead of wrapping it #623

Merged
merged 37 commits into from
Oct 18, 2023

Conversation

NoureldinYosri
Copy link
Collaborator

@NoureldinYosri NoureldinYosri commented Sep 21, 2023

This is to avoid the numpy limit on the number of dimensions quantumlib/Cirq#6031

The 1D representation should only be used when the number of qubits is greater than the numpy limit on the number of dimensions (currently set to 32) numpy/numpy#5744.

_, state_vector, _ = s.simulate_into_1d_array(c)

fixes quantumlib/Cirq#6031

@NoureldinYosri NoureldinYosri marked this pull request as ready for review September 22, 2023 19:35
qsimcirq/qsim_simulator.py Outdated Show resolved Hide resolved
def simulate_sweep_iter(
self,
program: cirq.Circuit,
params: cirq.Sweepable,
qubit_order: cirq.QubitOrderOrList = cirq.QubitOrder.DEFAULT,
initial_state: Optional[Union[int, np.ndarray]] = None,
as_1d_state_vector: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates the API defined by cirq.SimulatesFinalState.simulate_sweep_iter. If there are plans to modify that function as well, please link the relevant Cirq PR. (The required cirq version for qsim will also need to be updated if this is the case.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't violate the API. but it changes the internal data representation just for this if and only if as_1d_state_vector = True which by default is False so nothing changes unless the caller explicitly wants the 1D representation. This is done only for the qsim simulator.

The 1D representation is used for only one reason and that is to report the result, because the normal representation is a tensor that has number of dimensions equal to the num_qubits which breaks when the number of qubits is greater than the limit on numpy array dimensions (see issue in docstring & PR description)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QSimSimulator inherits from cirq.SimulatesFinalState, whose simulate_sweep_iter method does not have this new argument. Even though the implementation here can accept any valid input to the function of the parent class, it's still a violation of the API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the approach to adding a new method rather than extending the existing API.

@NoureldinYosri
Copy link
Collaborator Author

@95-martin-orion This PR is just a workaround to solve quantumlib/Cirq#6031 until numpy starts to support more than 32 dimensions numpy/numpy#5744

@NoureldinYosri NoureldinYosri changed the title allow representing simulation results as 1D array Create a new method to return the final state vector array instead of wrapping it Sep 22, 2023
Copy link
Collaborator

@95-martin-orion 95-martin-orion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some docstring requests, otherwise this LGTM. A new qsimcirq version is necessary to make this generally available - would you like me to cut a new release?

qsimcirq/qsim_simulator.py Outdated Show resolved Hide resolved
@NoureldinYosri
Copy link
Collaborator Author

@95-martin-orion

Some docstring requests, otherwise this LGTM. A new qsimcirq version is necessary to make this generally available - would you like me to cut a new release?

yes, please 😄

@95-martin-orion 95-martin-orion added the kokoro:run Trigger Kokoro builds for this PR. label Sep 25, 2023
@qsim-qsimh-bot qsim-qsimh-bot removed the kokoro:run Trigger Kokoro builds for this PR. label Sep 25, 2023
@qsim-qsimh-bot qsim-qsimh-bot removed the kokoro:run Trigger Kokoro builds for this PR. label Oct 3, 2023
@95-martin-orion
Copy link
Collaborator

Thank you for the myriad fixes, @NoureldinYosri !

Logs for the Kokoro error can be found here. I unfortunately don't have much context on this, though I do know that the Kokoro tests are not affected by the bazeltest.yml file.

@NoureldinYosri
Copy link
Collaborator Author

NoureldinYosri commented Oct 3, 2023

@95-martin-orion from the logs

WARNING: Download from [https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz](https://www.google.com/url?q=https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz&sa=D) failed: class java.io.FileNotFoundException GET returned 404 Not Found

It tries to download an old version of the TF runtime that no longer exists https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz

the versions that are still hosted on storage.googleapis.com/mirror.tensorflow.org are in http://mirror.tensorflow.org/. Where does it decide to go for that specific version of the runtime?


Looking deeper in the logs it looks like it pypasses that error and then gets a cuda11 environment but then decides to look at cuda12

ERROR: An error occurred during the fetch of repository 'ubuntu20.04-gcc9_manylinux2014-cuda11.2-cudnn8.1-tensorrt7.2_config_cuda':
...
No library found under: /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcupti.so.12.2

this looks to be the real problem

@95-martin-orion 95-martin-orion added the kokoro:run Trigger Kokoro builds for this PR. label Oct 4, 2023
@95-martin-orion
Copy link
Collaborator

{...} Where does it decide to go for that specific version of the runtime?

The files for this are stored in Google-internal repositories - I'll email you the links.

@qsim-qsimh-bot qsim-qsimh-bot removed the kokoro:run Trigger Kokoro builds for this PR. label Oct 4, 2023
@NoureldinYosri NoureldinYosri added the kokoro:run Trigger Kokoro builds for this PR. label Oct 18, 2023
@qsim-qsimh-bot qsim-qsimh-bot removed the kokoro:run Trigger Kokoro builds for this PR. label Oct 18, 2023
@NoureldinYosri NoureldinYosri added the kokoro:run Trigger Kokoro builds for this PR. label Oct 18, 2023
@NoureldinYosri NoureldinYosri merged commit 90d2707 into quantumlib:master Oct 18, 2023
15 checks passed
@NoureldinYosri NoureldinYosri deleted the np32 branch October 18, 2023 16:45
@rht
Copy link

rht commented Oct 28, 2023

@NoureldinYosri thank you once again for this feature. May I know the timeline for the next release for qsim?

@95-martin-orion
Copy link
Collaborator

@rht qsim releases are on an "as-needed" basis, which I think this qualifies for. I've opened #631 to cut the release.

@95-martin-orion
Copy link
Collaborator

@rht A new release has been cut and should be visible on pypi in the next 10-20 minutes.

@rht
Copy link

rht commented Oct 30, 2023

I see, thank you, just in time to do huge statevector for Halloween!

@rht
Copy link

rht commented Feb 8, 2024

@NoureldinYosri there was a delay in using this feature in our production instances. We were waiting for the cuQuantum Appliance to have qsimcirq>=0.17.x (NVIDIA/cuQuantum#98), but it hasn't happened.

But I was able to test this PR by straight up patching on qsimcirq 0.15.0 on cuQuantum Appliance 23.10. I am running a 2xA100 instance, with the following code

import time

from memory_profiler import memory_usage
import cirq
import qsimcirq

def f():
    num_qubits = 33
    qc_cirq = cirq.Circuit()
    qubits = cirq.LineQubit.range(num_qubits)
    for i in range(num_qubits):
        qc_cirq.append(cirq.H(qubits[i]))
    sim = qsimcirq.QSimSimulator()
    tic = time.time()
    # sim = cirq.Simulator()
    print("?", sim.simulate_into_1d_array)
    sim.simulate_into_1d_array(qc_cirq)
    print("Elapsed", time.time() - tic)
# print("Max memory", max(memory_usage(f)))
f()

but still got this OOM error

? <bound method QSimSimulator.simulate_into_1d_array of <qsimcirq.qsim_simulator.QSimSimulator object at 0x7f9e9ac62770>>
CUDA error: out of memory vector_mgpu.h 116

Here is the benchmark result for 32 qubits (haven't measured GPU memory usage from nvidia-smi yet)

Elapsed 14.033143758773804
Max memory 34182.4296875

Here is the manual patch I applied

535c535
<     def simulate_sweep_iter(
---
>     def _simulate_impl(
541,570c541
<     ) -> Iterator[cirq.StateVectorTrialResult]:
<         """Simulates the supplied Circuit.
< 
<         This method returns a result which allows access to the entire
<         wave function. In contrast to simulate, this allows for sweeping
<         over different parameter values.
< 
<         Avoid using this method with `use_gpu=True` in the simulator options;
<         when used with GPU this method must copy state from device to host memory
<         multiple times, which can be very slow. This issue is not present in
<         `simulate_expectation_values_sweep`.
< 
<         Args:
<             program: The circuit to simulate.
<             params: Parameters to run with the program.
<             qubit_order: Determines the canonical ordering of the qubits. This is
<               often used in specifying the initial state, i.e. the ordering of the
<               computational basis states.
<             initial_state: The initial state for the simulation. This can either
<               be an integer representing a pure state (e.g. 11010) or a numpy
<               array containing the full state vector. If none is provided, this
<               is assumed to be the all-zeros state.
< 
<         Returns:
<             List of SimulationTrialResults for this run, one for each
<             possible parameter resolver.
< 
<         Raises:
<             TypeError: if an invalid initial_state is provided.
<         """
---
>     ) -> Iterator[Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]]:
625a597,649
>             yield prs, qsim_state.view(np.complex64), cirq_order
> 
>     def simulate_into_1d_array(
>         self,
>         program: cirq.AbstractCircuit,
>         param_resolver: cirq.ParamResolverOrSimilarType = None,
>         qubit_order: cirq.QubitOrderOrList = cirq.ops.QubitOrder.DEFAULT,
>         initial_state: Any = None,
>     ) -> Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]:
>         """Same as simulate() but returns raw simulation result without wrapping it.
>             The returned result is not wrapped in a StateVectorTrialResult but can be used
>             to create a StateVectorTrialResult.
>         Returns:
>             Tuple of (param resolver, final state, qubit order)
>         """
>         params = cirq.study.ParamResolver(param_resolver)
>         return next(self._simulate_impl(program, params, qubit_order, initial_state))
> 
>     def simulate_sweep_iter(
>         self,
>         program: cirq.Circuit,
>         params: cirq.Sweepable,
>         qubit_order: cirq.QubitOrderOrList = cirq.QubitOrder.DEFAULT,
>         initial_state: Optional[Union[int, np.ndarray]] = None,
>     ) -> Iterator[cirq.StateVectorTrialResult]:
>         """Simulates the supplied Circuit.
>         This method returns a result which allows access to the entire
>         wave function. In contrast to simulate, this allows for sweeping
>         over different parameter values.
>         Avoid using this method with `use_gpu=True` in the simulator options;
>         when used with GPU this method must copy state from device to host memory
>         multiple times, which can be very slow. This issue is not present in
>         `simulate_expectation_values_sweep`.
>         Args:
>             program: The circuit to simulate.
>             params: Parameters to run with the program.
>             qubit_order: Determines the canonical ordering of the qubits. This is
>               often used in specifying the initial state, i.e. the ordering of the
>               computational basis states.
>             initial_state: The initial state for the simulation. This can either
>               be an integer representing a pure state (e.g. 11010) or a numpy
>               array containing the full state vector. If none is provided, this
>               is assumed to be the all-zeros state.
>         Returns:
>             Iterator over SimulationTrialResults for this run, one for each
>             possible parameter resolver.
>         Raises:
>             TypeError: if an invalid initial_state is provided.
>         """
> 
>         for prs, state_vector, cirq_order in self._simulate_impl(
>             program, params, qubit_order, initial_state
>         ):
627c651
<                 initial_state=qsim_state.view(np.complex64), qubits=cirq_order
---
>                 initial_state=np.complex64, qubits=cirq_order

@rht
Copy link

rht commented Feb 8, 2024

Something is still consuming more GPU memory much more than in the past. I used to be able to do 33 qubits on a 2xA100 instance.

$ nvidia-smi
Thu Feb  8 00:07:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    63W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0    61W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@NoureldinYosri
Copy link
Collaborator Author

NoureldinYosri commented Feb 8, 2024

For 32 qubits we have a state vector of $2^{32}$ complex entries each of each is 2 float32 numbers or 8 bytes so that is $2^{35}$ bytes so we should expect usage of at least $2^{35}$ bytes or 32GB . the value in #623 (comment) is 34182.4296875 MB or $~34.1$ GB. so we are only using $2.1$ GB more memory than the minimum necessary which I suppose is consumed by numpy overhead, other variables and maybe auxilary variables that will eventually be cleaned up by the garbage collector.


Are sure you could do 33 qubits on this machine?. The same calculation gives $2^{36}$ bytes or 64GB for 33 qubits. per https://www.aime.info/en/shop/product/aime-gpu-cloud-v242xa100/?pid=V28-2XA100-D1 2xA100 has only 40GB of ram per GPU.

@rht
Copy link

rht commented Feb 8, 2024

Are sure you could do 33 qubits on this machine?

Yes, we are able to do so on qsimcirq==0.12.1 via the cuQuantum Appliance, which has a multi-GPU backend. Hence, 2x40 GB is more than enough for 64 GB requirement of 33 qubits.

I am in the process of measuring the max GPU memory consumption by polling nvidia-smi in the background while the simulation is running, but this will take a while since I have terminated the instance and will have to wait until there is an open slot for the 2xA100 instance.

@rht
Copy link

rht commented Feb 8, 2024

Update: all is good! I am able to run 33 qubits on the 2xA100 instance. I confirm this PR works.

The bug in the code in #623 (comment) was that I forgot to specify

    options = qsimcirq.QSimOptions(gpu_mode=2)
    sim = qsimcirq.QSimSimulator(options)

My measurements (I'm not sure why the GPU memory is that low, but anyway, it works):

CPU only:
num_qubits 32
Elapsed 114.16660451889038
Peak GPU memory usage: 3 MiB
Max CPU memory 33086.91015625

GPU:
num_qubits 31
Elapsed 14.6939697265625
Peak GPU memory usage: 425 MiB
Max CPU memory 16830.81640625

num_qubits 32
Elapsed 28.458886861801147
Peak GPU memory usage: 425 MiB
Max CPU memory 33174.0078125

num_qubits 33
Elapsed 17.026336431503296
Peak GPU memory usage: 853 MiB
Max CPU memory 67345.63671875

The GPU memory is measured by reading the output of nvidia-smi --query-gpu=memory.used --format=csv.

@rht
Copy link

rht commented Feb 8, 2024

(I'm not sure why the GPU memory is that low, but anyway, it works)

My guess is that the time spent on the GPU is somewhat lower than the interval the nvidia-smi measures the VRAM (0.01 s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokoro:run Trigger Kokoro builds for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cirq.sample_state_vector fails when the number of qubits > 32
5 participants